R packages
mlbenchPimaIndiansDiabetes2 dataset from mlbenchThis week, we will be learning how to do classification with the four algorithms - logistic regression, LDA, kNN and SVMs.
The dataset we will be using is called PimaIndiansDiabetes and it is included as part of the mlbench package. You will first need to install the mlbench package. To load the data, do the following:
library(mlbench)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
data(PimaIndiansDiabetes2, package = "mlbench")
Inspect the PimaIndiansDiabetes2 and verify its structure. It should have 9 columns, 8 of those being numeric features and a single class label variable. If any missing data is observed, discard it and use complete cases.
Solution
dim(PimaIndiansDiabetes2)
## [1] 768 9
head(PimaIndiansDiabetes2)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 NA 33.6 0.627 50 pos
## 2 1 85 66 29 NA 26.6 0.351 31 neg
## 3 8 183 64 NA NA 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 NA NA 25.6 0.201 30 neg
vapply(PimaIndiansDiabetes2, class, character(1))
## pregnant glucose pressure triceps insulin mass pedigree age
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## diabetes
## "factor"
vapply(PimaIndiansDiabetes2, anyNA, logical(1))
## pregnant glucose pressure triceps insulin mass pedigree age
## FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## diabetes
## FALSE
complete.pimas <- PimaIndiansDiabetes2 %>% drop_na
Either directly above in the printed table or more explicitly with the `class` command applied to each element in the dataframe, we can see that the first 8 columns are numerically coded data and the last, 9th column, is a factor variable with the label `diabetes`.
Use glm() to perform logistic regression to classify observations as positive or negative for diabetes.
In particular, determine which of the features seem the most informative to explain the diabetes class.
Solution
logit.model <- glm(diabetes ~ ., data = PimaIndiansDiabetes2,
family = binomial(link = 'logit'))
summary(logit.model)
##
## Call:
## glm(formula = diabetes ~ ., family = binomial(link = "logit"),
## data = PimaIndiansDiabetes2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7823 -0.6603 -0.3642 0.6409 2.5612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.004e+01 1.218e+00 -8.246 < 2e-16 ***
## pregnant 8.216e-02 5.543e-02 1.482 0.13825
## glucose 3.827e-02 5.768e-03 6.635 3.24e-11 ***
## pressure -1.420e-03 1.183e-02 -0.120 0.90446
## triceps 1.122e-02 1.708e-02 0.657 0.51128
## insulin -8.253e-04 1.306e-03 -0.632 0.52757
## mass 7.054e-02 2.734e-02 2.580 0.00989 **
## pedigree 1.141e+00 4.274e-01 2.669 0.00760 **
## age 3.395e-02 1.838e-02 1.847 0.06474 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 498.10 on 391 degrees of freedom
## Residual deviance: 344.02 on 383 degrees of freedom
## (376 observations deleted due to missingness)
## AIC: 362.02
##
## Number of Fisher Scoring iterations: 5
The above prepares the session for classification analysis (not required for this subquestion but future ones), load the `e1071` and `caret` packages. After fitting the logistic regression with a call to `glm`, inspecting the fitted model reveals that the variable with the most significance is the `glucose` variable (ignoring the intercept). As an additional check, fit individual logistic regressions on each of the predictors individually,
predictors <- names(PimaIndiansDiabetes2)[-9]
sapply(predictors,
function(x) summary(glm(as.formula(paste0("diabetes ~ ", x)),
family = binomial(link = "logit"),
data = PimaIndiansDiabetes2))$coefficients[2, 4])
## pregnant glucose pressure triceps insulin mass
## 2.147445e-09 2.985479e-33 5.718197e-06 8.023829e-09 7.166747e-08 4.309761e-16
## pedigree age
## 3.702926e-06 1.773155e-10
The above code extracts out the p-values for the predictors fitted on a simple logistic regression where there is only one predictor in each case. We can see that `glucose` has a p-value that is zero to 30 decimal places.
Compute the accuracy of the logistic regression classifier across the entire training data set.
Solution
logit.decision <- ifelse(logit.model$fitted.values > 0.5, "pos", "neg")
complete.pima <- PimaIndiansDiabetes2 %>% drop_na
logit.accuracy <- mean(logit.decision == complete.pima$diabetes, na.rm = TRUE) * 100
logit.accuracy
## [1] 78.31633
Using the complete data in the dataset, the accuracy is 78.3163265%.
Install and load the caret package. Use the caret package for this question.
Partition the PimaIndiansDiabetes2 dataset into 75% training and 25% test.
Solution
set.seed(123)
inTrain <- createDataPartition(complete.pima$diabetes, p = .75)[[1]]
pimatrain <- complete.pima[ inTrain, ]
pimatest <- complete.pima[-inTrain, ]
head(inTrain)
## [1] 1 2 3 4 5 6
nrow(pimatrain)
## [1] 295
nrow(pimatest)
## [1] 97
# Alternatively using only base R commands
n.dat <- nrow(complete.pima)
training <- sample.int(n.dat, size = floor(n.dat * 75/100))
sets <- rep("test", n.dat)
sets[training] <- "train"
sets
## [1] "test" "train" "test" "train" "test" "train" "train" "train" "train"
## [10] "train" "train" "train" "test" "train" "train" "train" "train" "test"
## [19] "train" "train" "train" "test" "train" "train" "train" "train" "test"
## [28] "train" "train" "train" "train" "train" "train" "test" "train" "train"
## [37] "train" "train" "train" "test" "train" "test" "test" "test" "train"
## [46] "train" "train" "train" "train" "train" "test" "train" "train" "train"
## [55] "train" "test" "train" "test" "train" "train" "train" "test" "train"
## [64] "train" "train" "train" "train" "test" "test" "train" "train" "train"
## [73] "train" "train" "train" "train" "test" "train" "train" "train" "train"
## [82] "test" "train" "train" "test" "train" "train" "train" "test" "test"
## [91] "train" "test" "train" "train" "train" "train" "test" "test" "test"
## [100] "train" "train" "train" "train" "test" "train" "train" "train" "train"
## [109] "train" "train" "train" "train" "train" "train" "train" "test" "train"
## [118] "train" "train" "train" "train" "train" "train" "train" "train" "test"
## [127] "train" "train" "train" "train" "train" "train" "test" "train" "train"
## [136] "train" "train" "train" "train" "train" "train" "train" "test" "train"
## [145] "train" "train" "test" "train" "test" "train" "train" "test" "train"
## [154] "train" "train" "train" "train" "train" "train" "train" "train" "train"
## [163] "train" "test" "train" "train" "train" "train" "train" "train" "test"
## [172] "train" "test" "train" "train" "train" "train" "train" "train" "train"
## [181] "test" "train" "train" "train" "train" "train" "test" "train" "train"
## [190] "train" "train" "test" "train" "train" "test" "test" "train" "test"
## [199] "train" "train" "train" "train" "train" "train" "train" "train" "train"
## [208] "train" "train" "test" "train" "test" "train" "train" "test" "train"
## [217] "train" "train" "train" "train" "test" "train" "train" "train" "test"
## [226] "train" "train" "train" "test" "train" "test" "train" "train" "train"
## [235] "test" "test" "train" "train" "train" "train" "train" "train" "train"
## [244] "train" "train" "test" "test" "train" "test" "train" "train" "test"
## [253] "train" "train" "test" "train" "test" "test" "train" "train" "train"
## [262] "train" "train" "train" "test" "test" "train" "train" "train" "test"
## [271] "test" "test" "train" "train" "train" "train" "train" "train" "train"
## [280] "train" "train" "train" "test" "test" "train" "train" "train" "train"
## [289] "train" "train" "train" "train" "train" "train" "train" "train" "train"
## [298] "train" "train" "test" "train" "test" "test" "test" "train" "train"
## [307] "train" "train" "train" "train" "train" "test" "train" "train" "train"
## [316] "train" "train" "train" "train" "train" "train" "train" "train" "train"
## [325] "test" "train" "test" "train" "test" "train" "train" "test" "train"
## [334] "train" "test" "train" "train" "train" "train" "test" "test" "test"
## [343] "test" "test" "train" "test" "train" "train" "train" "train" "test"
## [352] "test" "test" "train" "train" "train" "train" "train" "train" "test"
## [361] "test" "train" "train" "train" "train" "test" "train" "train" "train"
## [370] "train" "train" "train" "train" "test" "train" "test" "train" "train"
## [379] "test" "test" "train" "test" "test" "train" "test" "train" "train"
## [388] "test" "test" "train" "train" "train"
training.test.split <- split(complete.pima, sets)
vapply(training.test.split, nrow, numeric(1L))/n.dat * 100
## test train
## 25 75
Using the training dataset and all the given features, train three classifiers (logistic regression, LDA, kNN and SVM classifiers). This can be done using the caret package and selecting the appropriate method parameter argument in the train function. For SVM, consider usnig the choice method = "svmLinearWeights". A full list of supported methods are given here. Compute the accuracy of each classifier on the training dataset.
# Logistic regression
logit.model <- train(diabetes ~ . , data = pimatrain, method = "glm",
family = binomial(link = 'logit'),
trControl = trainControl(method = "repeatedcv", repeats = 5))
logit.model
## Generalized Linear Model
##
## 295 samples
## 8 predictor
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 265, 265, 265, 266, 266, 266, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7892496 0.4981545
# LDA
lda.model <- train(diabetes ~ . , data = pimatrain, method = "lda",
trControl = trainControl(method = "repeatedcv", repeats = 5))
lda.model
## Linear Discriminant Analysis
##
## 295 samples
## 8 predictor
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 265, 267, 265, 265, 266, 266, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7824335 0.4856005
# knn
knn.model <- train(diabetes ~ . , data = pimatrain, method = "knn",
trControl = trainControl(method = "repeatedcv", repeats = 5))
knn.model
## k-Nearest Neighbors
##
## 295 samples
## 8 predictor
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 266, 265, 265, 266, 266, 266, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7338949 0.3650104
## 7 0.7501511 0.3889205
## 9 0.7602660 0.4171574
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
# svm
svm.model <- train(diabetes ~ ., data = pimatrain, method = "svmLinearWeights",
trControl = trainControl(method = "repeatedcv", repeats = 5))
svm.model
## Linear Support Vector Machines with Class Weights
##
## 295 samples
## 8 predictor
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 266, 265, 265, 265, 265, 266, ...
## Resampling results across tuning parameters:
##
## cost weight Accuracy Kappa
## 0.25 1 0.7847373 0.4903511
## 0.25 2 0.7829901 0.5308560
## 0.25 3 0.7448456 0.4800309
## 0.50 1 0.7868292 0.4985819
## 0.50 2 0.7817504 0.5279283
## 0.50 3 0.7496043 0.4872269
## 1.00 1 0.7855452 0.4952968
## 1.00 2 0.7811544 0.5280862
## 1.00 3 0.7502709 0.4885899
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.5 and weight = 1.
#
tr.control <- trainControl(method = "repeatedcv", repeats = 5)
# Or using a single call using lapply
class.methods <- c("glm", "lda", "knn", "svmLinearWeights")
fitted.models <- lapply(class.methods, function(m) {
if (m != "glm")
train(diabetes ~ ., data = pimatrain, method = m, trControl = tr.control)
else
train(diabetes ~ ., data = pimatrain, method = m, trControl = tr.control,
family = binomial(link = "logit"))
})
names(fitted.models) <- class.methods
From the training data, it seems that the SVM, LDA and logistic models give the best accuracy at around 78~79% while the kNN lags behind slightly at 76%.
Using the trained classification models, classify the test set data and Compare their test set accuracies.
# Individual accuracies can be computed like the following in Logistic regression
logit.pred <- predict(logit.model, newdata = pimatest)
# use the helper function from caret that computes the classifier metrics
logit.confusion <- confusionMatrix(logit.pred, pimatest$diabetes)
# Alternatively, all the accuracies can be computed
metrics <- lapply(fitted.models, function(mod) {
predictions <- predict(mod, newdata = pimatest)
confusionMatrix(predictions, pimatest$diabetes)
})
names(metrics) <- c("Logistic", "LDA", "kNN", "SVM")
overall <- vapply(metrics, "[[", numeric(7), what = "overall")
overall
## Logistic LDA kNN SVM
## Accuracy 0.74226804 0.75257732 0.6804124 0.74226804
## Kappa 0.41240611 0.44038462 0.2595420 0.41240611
## AccuracyLower 0.64346836 0.65458083 0.5779888 0.64346836
## AccuracyUpper 0.82575179 0.83459178 0.7714775 0.82575179
## AccuracyNull 0.67010309 0.67010309 0.6701031 0.67010309
## AccuracyPValue 0.07813444 0.05013472 0.4618798 0.07813444
## McnemarPValue 1.00000000 1.00000000 0.7194375 1.00000000
Assessing the models on the test data instead of the training data we can see that all models have similar performance where again the three classification techniques of Logistic, LDA and SVM getting similar test accuracies around 75% while the kNN model has slightly worse accuracy at 68%.
Consider only two features for predictors for ease of visualization. Construct models for kNN, logistic regression and SVM to classify the diabetes response based on the predictors glucose and mass.
In particular, plot the data of mass against glucose and colour the points by the diabetes labels. Then add to your plot the decision boundaries for a logistic regression and linear SVM using only the mass and glucose predictors. (Assume for logistic regression that the decision boundary is determined using a cutoff of 0.5 for the predicted probabilities).
Solution
logit.model2 <- glm(diabetes ~ glucose + mass, data = complete.pimas, family = binomial(link = "logit"))
logit.coefs <- coefficients(logit.model2)
logit.slope <- -logit.coefs["glucose"]/logit.coefs["mass"]
logit.intercept <- -logit.coefs["(Intercept)"]/logit.coefs["mass"]
# Using the same parameter search as earlier
svm.model2 <- train(diabetes ~ glucose + mass, data = pimatrain, method = "svmLinearWeights", scale = FALSE,
trControl = trainControl(method = "repeatedcv", repeats = 5))
svm.coefs.linear <- coef(svm.model2$finalModel)
svm.linear.slope <- -svm.coefs.linear["glucose"]/svm.coefs.linear["mass"]
svm.linear.intercept <- -svm.coefs.linear["(Intercept)"]/svm.coefs.linear["mass"]
cols <- ifelse(complete.pimas$diabetes == "pos", "blue", "red")
plot(mass ~ glucose, data = complete.pimas, col = cols, pch = 16)
abline(logit.intercept, logit.slope, lty = "dotted")
abline(svm.linear.intercept, svm.linear.slope)
Generate the regions for the kNN method and support vector machines using a radial kernel and comment on the differences between the generated boundaries.
Solution
viz.knn.model <- train(diabetes ~ glucose + mass, data = pimatrain, method = "knn",
prob = TRUE,
trControl = trainControl(method = "repeatedcv", repeats = 5))
# mapping decision boundary
relevant.dat <- complete.pimas %>% dplyr::select(glucose, mass)
grids <- lapply(relevant.dat, function(x) seq(from = min(x), to = max(x), length.out = 128))
viz.dat <- expand.grid(grids)
knn.predictions <- predict(viz.knn.model, viz.dat)
svm.radial.model <- train(diabetes ~ mass + glucose, data = pimatrain, method = "svmRadialWeights",
trControl = trainControl(method = "repeatedcv", repeats = 5))
svm.predictions <- predict(svm.radial.model, viz.dat)
par(mfrow = c(1, 2))
plot(mass ~ glucose, data = complete.pimas, col = cols, pch = 16)
points(viz.dat, col = ifelse(knn.predictions == "neg", "red", "blue"), cex = 0.3, pch = 16)
plot(mass ~ glucose, data = complete.pimas, col = cols, pch = 16)
points(viz.dat, col = ifelse(svm.predictions == "neg", "red", "blue"), cex = 0.3, pch = 16)
Both generated decision regions are nonlinear and non-parametric. However, the kNN fits are much more volatile since they are more readily impacted by lone points in this case. The radial kernel can adapt non-linearly but also retains smoothness.